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Reliability  and  Validity  of 
Adaptive  and  Conventional  Tests 
in  a  Military  Recruit  Population 


Testing  theorists  have  proposed  a  number  of  adaptive  testing  strategies 
over  the  last  two  decades  (see  Weiss,  1974).  Although  mechanical  selection 
strategies  were  dominant  at  the  beginning  of  the  1970s,  they  have  now  been 
largely  replaced  by  item  selection  strategies  based  on  item  response  theory 
(IRT).  In  mechanical  item  selection  strategies,  items  are  selected  sequentially 
on  the  basis  of  their  position  in  a  structured  item  pool.  Hence,  at  any  point 
in  the  test,  only  certain  items  are  available  for  selection  and  presentation. 
IRT-based  item  selection  strategies  select  items  which  minimize  or  maximize  some 
mathematical  quantity.  Thus,  any  item  in  the  pool  is  potentially  available  for 
selection.  The  dominant  mathematical  item  selection  strategies  are  maximum  in¬ 
formation  and  Owen's  Bayesian  procedure. 

Maximum  information  item  selection  involves  selecting  at  each  stage  of  an 
adaptive  test  the  test  item  that  has  the  highest  level  of  psychometric  informa¬ 
tion  at  the  examinee's  current  ability  estimate.  This  testing  strategy  has  been 
used  in  a  number  of  studies  (Bejar  &  Weiss,  1978;  Bejar,  Weiss,  &  Gialluca, 

1977;  Prestwood  &  Weiss,  1978).  It  is  preferred  by  some  adaptive  testing  re¬ 
searchers  (e.g..  Lord,  1976)  because  it  does  not  make  prior  judgments  as  to  the 
distribution  of  ability  in  the  population.  However,  others  (e.g.,  Samejima, 
1969;  Urry,  1977)  have  claimed  that  maximum  likelihood  scoring  procedures,  which 
are  usually  utilized  in  conjunction  with  maximum  information  item  selection, 
implicitly  specify  a  flat  prior  distribution,  and  a  flat  prior  distribution  of 
ability  would  seldom  correspond  to  the  actual  distribution  of  ability  in  the 
population.  Additionally,  maximum  likelihood  estimates  for  an  Individual's 
ability  level  do  not  explicity  exist  when  that  individual  answers  all  items  cor¬ 
rectly  or  all  items  incorrectly;  and,  occasionally,  maximum  likelihood  scoring 
can  result  in  Indeterminant  ability  estimates  for  an  individual  on  short  tests. 

For  these  reasons  some  adaptive  testing  researchers  combine  maximum  infor¬ 
mation  item  selection  with  Bayesian  scoring  procedures.  The  Bayesian  modal  pro¬ 
cedure  (Samejima,  1969)  scores  response  patterns  by  using  the  mode  of  the  poste¬ 
rior  ability  distribution  as  the  estimate  of  ability,  where  the  initial  prior 
distribution  is  usually  specified  as  having  a  mean  of  0  and  a  standard  deviation 
of  1.  Owen's  (1969,  1975)  Bayesian  scoring  method,  which  can  be  combined  with 
maximum  information  item  selection  (e.g..  Brown  &  Weiss,  1977;  Kingsbury  & 

Weiss,  1979)  is  similar  to  Bayesian  modal  procedures  except  that  ability  is  es¬ 
timated  by  using  the  mean  of  the  posterior  ability  distribution.  Both  Bayesian 
scoring  methods,  however,  require  the  assumption  of  a  normal  distribution  of 
ability.  Owen's  Bayesian  scoring  method,  when  combined  with  a  Bayesian  item 
selection  procedure,  provides  a  fully  Bayesian  strategy  for  adaptive  test  admin¬ 
istration  (Owen,  1969,  1975)  in  which  items  are  selected  at  each  stage  of  the 
test  to  minimize  the  Bayesian  posterior  variance  of  the  ability  estimate. 

Research  on  Owen's  Bayesian  Adaptive  Testing  Strategy 

Simulation  studies.  Many  simulation  studies  have  shown  that  Owen's  Bayes¬ 
ian  adaptive  testing  strategy  results  in  stable,  reliable,  and  valid  scores  even 
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for  very  short  tests  (Jensema,  1974,  1976;  McBride,  1977;  McBride  &  Weiss,  1976; 
Urry,  1974).  For  example,  Urry  (1974)  found  that  Owen's  Bayesian  strategy 
achieved  the  reliability  of  a  60-item  conventional  test  in  from  10  to  15  items. 
Urry  (1977)  found  that  the  validity  of  scores  from  Owen's  Bayesian  procedure  for 
a  sample  of  57  live  examinees  was  higher  than  that  predicted  by  theory  and  by 
simulation  results.  However,  Urry  did  not  employ  any  other  testing  strategies 
that  could  be  used  for  comparison  with  the  Bayesian  strategy,  and  his  sample  was 
sufficiently  small  so  that  the  unexpectedly  high  validities  may  well  have  been  a 
sampling  artifact. 

Gorman  (1980)  compared  three  types  of  conventional  tests  (strongly  peaked, 
somewhat  peaked,  and  rectangular)  to  adaptive  tests  using  maximum  information 
item  selection  and  Bayesian  modal  scoring  and  to  Owen's  Bayesian  adaptive  test¬ 
ing  strategy.  Using  both  known  and  estimated  item  parameters,  he  found  both 
Bayesian  procedures  superior  to  any  conventional  procedure  on  all  evaluation 
criteria,  which  were  (1)  the  fidelity  coefficient  (correlation  of  true  and  esti¬ 
mated  ability  scores),  (2)  conditional  bias  (mean  directional  error  of  ability 
estimates),  (3)  conditional  accuracy  (root  mean  square  error  of  ability  esti¬ 
mates),  and  (4)  conditional  precision  (derived  from  the  test  score  information 
function).  He  found  that  Owen's  Bayesian  procedure  provided  less  bias  using 
estimated  item  parameters  than  did  the  Bayesian  modal  adaptive  or  Bayesian- 
modally-scored  conventional  strategies.  Altogether  the  Owen  procedure  provided 
somewhat  better  psychometric  properties  than  the  Bayes  modal  procedure.  Gorman 
also  found  that  for  all  of  the  adaptive  tests  evaluated,  their  superiority  over 
conventional  tests  increased  as  a  function  of  item  discriminations. 

Thus,  these  simulation  studies  have  shown  that  Owen's  Bayesian  adaptive 
procedure  achieves  specified  levels  of  measurement  precision  using  far  fewer 
items  than  conventional  testing  procedures  and  results  in  scores  with  substan¬ 
tially  higher  reliability  and  validity  than  those  from  conventional  tests  of  the 
same  length. 

Live-testing  studies.  One  of  the  first  reported  live-testing  studies  of 
Owen's  Bayesian  adaptive  testing  strategy  (Thompson  &  Weiss,  1980)  was  based  on 
a  group  of  about  100  college  undergraduates.  The  study  compared  criterion- 
related  validity  of  the  adaptive  testing  strategy  with  conventional  tests  admin¬ 
istered  to  another  group  of  students.  Correlations  of  ability  estimates  with 
grade-point  averages  (GPA)  were  higher  for  the  Bayesian  test  than  for  the  con¬ 
ventional  test.  Scores  on  the  Bayesian  test  correlated  significantly  higher 
with  high  school  GPA  (jr  *  .51)  than  did  the  number-correct  score  on  the  conven¬ 
tional  test  (x  *  *40),  even  though  the  median  number  of  items  in  the  Bayesian 
test  was  12.5%  fewer  than  were  administered  in  the  conventional  test. 

Kingsbury  and  Weiss  (1980)  reported  the  first  large-scale  investigation  of 
the  performance  of  Owen's  Bayesian  strategy  in  live  testing.  They  examined  both 
alternate  forms  reliability  and  concurrent  validity  of  Owen's  Bayesian  strategy 
in  comparison  with  a  conventional  ability  test.  They  administered  to  472  col¬ 
lege  students  a  120-item  conventional  criterion  test  scored  by  Bayesian  methods, 
two  30-ltem  conventional  tests,  and  two  30-item  adaptively  administered  Bayesian 
tests.  The  results  were  not  completely  in  accord  with  theoretical  expectations. 
For  tests  of  one  and  two  items  in  length,  the  conventional  strategy  was  superior 
in  parallel  forms  reliability;  the  adaptive  tests  achieved  higher  reliabilities 
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for  test  lengths  of  four  to  30  Items.  However,  the  conventional  strategy 
achieved  consistently  higher  validities  than  the  Bayesian  adaptive  strategy. 

In  a  third  live-testing  study,  also  using  large  groups  of  college  students, 
Johnson  and  Weiss  (1980)  compared  30-item  conventional,  30-item  Bayesian  adap¬ 
tive,  and  30-item  maximum  information  tests.  They  concluded  that  the  alternate 
forms  of  the  conventional  test  were  more  nearly  parallel  than  the  alternate 
forms  of  either  adaptive  strategy.  Parallel  forms  reliabilities  were  similar 
for  the  conventional  strategy  and  the  two  adaptive  strategies  for  tests  up  to 
about  10  items  in  length.  After  that  point,  conventional  test  reliabilities 
were  higher  than  those  of  the  adaptive  strategies. 

Three  factors  may  have  contributed  to  these  unexpected  results:  (1)  the 
item  pool  had  fewer  items  at  the  extremes  of  the  ability  distribution  than  near 
the  center  of  the  ability  distribution,  and  the  items  at  the  extreme  were  of 
lower  discrimination;  (2)  error  in  item  parameter  estimates  may  have  been  of 
sufficient  magnitude  to  degrade  the  effectiveness  of  the  adaptive  testing  strat¬ 
egies;  (3)  the  range  of  the  ability  distribution  in  the  college  student  sample 
was  small.  Data  presented  by  Johnson  and  Weiss  (1980)  suggest  that  inadequacies 
in  the  item  pool  might  have  accounted  for  the  failure  of  the  adaptive  tests  to 
perform  in  accordance  with  expectations.  Their  data  on  conditional  errors  of 
measurement  show  that  the  standard  error  of  measurement  (SOI),  which  was  always 
lower  for  the  adaptive  tests,  increased  for  the  maximum  information  strategy, 
especially  at  the  lower  end  of  the  ability  distribution,  in  contrast  to  simula¬ 
tion  studies  which  show  essentially  flat  information  (and,  therefore,  SEM)  func¬ 
tions.  Since  the  SOI  in  an  adaptive  measurement  is  a  joint  function  of  the  dis¬ 
crimination  of  the  items  and  the  number  of  items  near  the  current  estimated 
ability  level,  the  combination  of  insufficient  numbers  of  items  with  relatively 
low  levels  of  item  discrimination  toward  the  lower  extreme  of  the  ability  dis¬ 
tribution  might  have  resulted  in  the  poorer  performance  of  the  adaptive  tests  in 
comparison  to  the  conventional  test. 

All  three  of  these  live-testing  studies  of  the  Bayesian  adaptive  strategy 
were  confounded  by  the  small  numbers  of  examinees  on  which  the  item  parameters 
were  obtained  and  by  nonoptlmal  item  pools  for  the  adaptive  strategy.  In  addi¬ 
tion,  because  all  studies  were  based  on  data  from  college  students,  restrictions 
in  the  range  of  abilities  in  the  population  undoubtedly  affected  the  correla¬ 
tional  results.  Finally,  in  the  Kingsbury  and  Weiss  (1980)  study,  method  vari¬ 
ance  might  have  been  partially  responsible  for  the  higher  correlations  of  the 
conventional  experimental  tests  with  the  conventional  criterion  tests. 

McBride  (1980),  in  a  live-testing  pilot  study  on  which  the  present  study  is 
based,  found  that  Owen’s  Bayesian  procedure  produced  verbal  ability  scores  that 
were  more  reliable  and  valid  at  all  test  lengths  than  a  conventional  ability 
test.  Since  he  tested  Marine  recruits,  restriction  of  the  ability  range  should 
have  been  less  severe  than  in  the  case  of  the  college  population.  He  concluded 
from  his  data  that  a  fixed-length  adaptive  test  was  as  reliable  as  a  variable- 
length  adaptive  test,  and  that  adaptive  tests  of  about  10  items  were  sufficient¬ 
ly  reliable  for  military  personnel  testing  purposes.  This  was  the  first  compar¬ 
ative  live-data  study  that  fulfilled  theoretical  expectations. 


Research  on  Other  Aspects  of  Adaptive  Testini 


An  important  aspect  of  computerized  testing  is  how  testing  strategy  is  re¬ 
lated  to  the  time  it  takes  examinees  to  complete  a  test.  It  might  be  expected 
that  for  items  that  are  in  the  middle  range  of  difficulty  for  an  individual  ex¬ 
aminee,  response  latencies  (and,  therefore,  total  testing  time)  would  be  greater 
than  for  items  that  are  much  too  easy  or  much  too  difficult  for  that  examinee. 
Since  adaptive  testing  procedures  select  items  for  administration  that  are  near 
the  ability  level  of  the  examinee,  whereas  the  conventional  strategy  does  not, 
there  may  be  differences  in  response  latencies  (or  total  testing  time)  due  to 
the  testing  strategy.  Using  ANOVA,  Betz  and  Weiss  (1976)  compared  mean  item 
latencies  employing  knowledge  of  results  (KR),  test  type,  and  ability  level  as 
the  independent  variables.  Although  latencies  for  the  stradaptive  tests  were 
slightly  longer,  differences  were  not  statistically  significant  for  test  type 
but  were  statistically  significant  for  ability  level.  Waters  (1977)  found  that 
examinees  responding  to  items  in  a  stradaptive  test  required  about  11%  longer  (j> 
<  *  .05)  to  respond  to  each  item  than  did  examinees  who  took  a  conventional 
test. 


Johnson,  Weiss,  and  Prestwood  (1981)  also  found  that  items  on  stradaptive 
tests  took  examinees  an  average  of  4%  longer  for  fixed-length  tests  and  an  aver¬ 
age  of  11%  longer  for  variable-length  tests  in  comparison  with  conventionally 
administered  items.  They  also  noted  that  examinees  taking  the  conventional 
tests  more  frequently  reported  that  the  items  were  too  easy  or  too  difficult  for 
them,  in  comparison  with  those  taking  stradaptive  tests. 

Previously  published  research  on  computer-administered  testing  has  not  ad¬ 
dressed  the  Important  practical  question  of  whether  novices  have  problems  in 
learning  to  use  the  equipment.  Such  information  is  particularly  important, 
along  with  information  on  the  length  of  time  it  takes  examinees  to  learn  to  use 
the  equipment,  in  evaluations  of  the  feasibilty  of  adaptive  testing  in  large 
unselected  populations. 

Purpose 

The  present  study  was  undertaken  primarily  to  further  study  the  reliability 
and  validity  of  Bayesian  adaptive  tests,  in  comparison  with  conventional  tests, 
in  a  military  recruit  population.  Also  of  interest  were  a  comparison  of  the 
amount  of  time  required  for  administration  of  the  adaptive  and  conventional 
tests  and  an  evaluation  of  the  effectiveness  of  the  instructional  sequence  for 
this  population. 


METHOD 


Subjects 

Subjects  were  553  male  Marine  recruits  from  the  Marine  Corps  Recruiting 
Department  (MCRD)  in  San  Diego,  California.  In  contrast  to  the  design  of  the 
Kingsbury  and  Weiss  (1980)  alternate  forms  study,  in  the  present  study  an  inde¬ 
pendent  groups  design  was  used  in  which  recruits  were  sequentially  assigned  to 
an  adaptive  or  a  conventional  testing  group.  There  were  263  recruits  in  the 
adaptive  test  group  and  267  in  the  conventional  test  group. 


Procedures 


Testing  equipment.  Testing  was  controlled  by  a  Hewlett-Packard  real-time 
minicomputer  system  located  at  the  University  of  Minnesota  in  Minneapolis.  A 
multiplexed  leased  telephone  circuit  was  connected  to  four  cathode  ray  terminals 
(CRTs)  operating  at  120  characters  per  second  at  MCRD.  The  testing  room  was 
continually  monitored  by  a  test  proctor,  who  helped  the  recruits  become  familiar 
with  the  equipment,  answered  "proctor  calls"  generated  by  the  testing  system, 
and  insured  that  the  equipment  was  operating  satisfactorily. 

Instructional  sequence.  Since  the  Marine  recruit  examinees  were  not  ex¬ 
pected  to  be  familiar  with  the  operation  of  a  CRT,  a  sequence  of  instructional 
screens  was  presented  to  each  examinee  before  beginning  test  administration. 

The  15  primary  instructional  screens,  based  on  those  originally  described  by 
DeWltt  and  Meiss  (1974)  and  used  for  several  thousand  test  administrations 
since,  are  shown  in  Appendix  Table  A.  The  instructional  screen  sequence 
assisted  the  recruits  in  learning  to  communicate  with  the  computer  by  requesting 
that  they  (1)  type  a  number  and  press  the  return  key,  (2)  type  "GO"  and  press 
the  return  key,  (3)  use  the  shift  key,  and  (4)  demonstrate  their  ability  to 
change  a  response  that  was  already  typed.  Appropriate  error  sequences  were  pro¬ 
vided  (see  Appendix  Tables  A  and  B)  to  give  examinees  additional  help  when  need¬ 
ed.  Repeated  errors  resulted  in  an  audible  proctor  call;  when  this  occurred, 
the  proctor  Intervened  directly  to  assist  the  examinee  in  learning  use  of  the 
CRT  terminal. 

After  the  examinee  had  demonstrated  his  understanding  of  the  mechanics  of 
CRT  operation,  five  sample  verbal  ability  items  were  presented  to  familiarize 
him  with  the  item  types  and  formats  he  would  encounter  in  the  experimental  and 
criterion  tests.  Item  types  consisted  of  Sentence  Completion,  Synonyms,  Analo¬ 
gies,  and  Opposites.  The  sample  items  (see  Appendix  Table  A)  were  chosen  to  be 
very  easy  Items  that  would  be  likely  to  be  answered  correctly  by  all  examinees. 
If  an  incorrect  answer  was  given,  the  examinee  was  given  a  second  opportunity  to 
answer  the  question;  an  incorrect  answer  the  second  time  the  screen  was  present¬ 
ed  led  to  a  proctor  call. 

Item  pool.  The  items  consisted  of  the  same  150  five-alternative  multiple- 
choice  verbal  ability  items  used  by  McBride  (1980)  in  the  pilot  study.  IRT  pa¬ 
rameters  for  the  items  were  estimated  using  Urry's  (1976;  Gugel,  Schmidt  &  Urry, 
1976)  OGIVIA  program,  based  on  samples  of  980  to  2,200  Marine  recruits.  All 
item  response  function  (IRF)  discrimination  parameters  were  greater  than  £  » 

.80,  difficulties  were  approximately  rectangularly  distributed  between  _b  -  +2 
and  -2,  and  "guessing"  parameters  were  less  than  £  -  .30.  As  Appendix  Table  C 
shows,  the  mean  discrimination  parameter  for  the  pool  was  a  relatively  high  £  - 
1.24,  the  mean  difficulty  was  _b  -  -.09,  while  the  mean  guessing  parameter  was  £ 
»  .12.  The  classical  item  parameters  for  these  150  items  were  a  mean  biserial 
correlation  of  .76  and  a  mean  difficulty  of  £  *  .57. 

Tests 


Experimental  tests.  The  conventional  test  consisted  of  two  alternate 
forms,  each  30  items  in  length.  Both  conventional  forms  were  administered  on 
the  CRT  at  the  same  time.  Items  were  presented  from  each  form  (Forms  1  and  2) 


in  the  repeating  order  12212112.  The  conventional  tests  were  constructed  to 
have  a  rectangular  distribution  of  item  difficulties  spanning  the  difficulty 
range  of  the  item  pool  (IRT  parameters  and  classical  item  parameters  for  each 
conventional  test  item  are  shown  in  Appendix  Table  D).  Rectangular  conventional 
tests  were  employed  to  equalize  measurement  precision  across  ability  levels  and 
to  be  similar  to  the  verbal  tests  used  in  the  Armed  Services  Vocational  Aptitude 
Battery.  The  two  forms  were  constructed  to  be  “weakly  parallel"  (Samejlma, 
1977),  i.e.,  to  have  test  information  functions  that  were  approximately  equal. 

To  select  the  items  for  the  conventional  tests,  the  150  items  in  the  item 
pool  were  sorted  into  five  difficulty  levels.  Six  items  were  selected  in  a  bal¬ 
anced  way  from  each  difficulty  level  for  each  form  of  the  conventional  test, 
starting  with  the  most  discriminating  items  at  each  level.  This  design  was  used 
so  that  more  discriminating  items  would  appear  earlier  in  the  test  than  less 
discriminating  items,  thus  allowing  a  more  meaningful  comparison  with  the  adap¬ 
tive  tests,  which  were  expected  to  select  the  most  discriminating  items  toward 
the  beginning  of  the  test.  This  procedure  resulted  in  mean  discriminations  of 
■  1.42  for  Form  1  and  &  *  1.46  for  Form  2,  mean  difficulties  of  b  m  -.50  and  J>  = 
-.32  for  the  two  forms,  respectively,  and  mean  “guessing"  parameters  of  c  *  .11 
for  both  forms  (see  Appendix  Table  D).  Figure  1  shows  the  test  information 
curves  for  Forms  1  and  2  of  the  conventional  tests.  As  can  be  seen,  the  test 
construction  procedures  resulted  in  very  similar  information  functions  for  the 
two  forms,  thus  fulfilling  Samejlma' s  (1977)  weakly  parallel  criterion.  The 
conventional  tests  were  scored  by  number  correct  at  each  test  length  from  1  to 
30  items. 


Figure  1 

Test  Information  Functions  for  Forms  1  and  2 
of  the  30-Item  Conventional  Tests 


Administration  of  alternate  forms  of  the  tests  to  the  adaptive  test  group 
was  similar  to  the  procedure  used  with  the  conventional  test  group,  with  the 
exception  that  items  were  selected  by  means  of  Owen's  (1969,  1975)  Bayesian  se¬ 
quential  adaptive  testing  procedure.  For  each  of  the  two  adaptive  forms  (Form  1 
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and  Form  2)  items  were  Independently  selected  from  the  item  pool  in  the  re¬ 
peating  order  12212112.  To  operationalize  this  procedure,  as  was  done  by 
Kingsbury  and  Weiss  (1980)  and  Johnson  and  Weiss  (1980),  one  item  was  selected 
from  the  pool  as  needed  and  assigned  to  Form  1  or  Form  2  of  the  adaptive  test 
according  to  the  12212112  rotational  scheme.  This  procedure  was  repeated  after 
each  item  was  answered.  As  with  the  conventional  tests,  adaptive  tests  were  30 
items  in  length,  and  no  item  was  common  to  both  forms  for  an  individual  examin¬ 
ee.  The  adaptive  tests  were  scored  at  test  lengths  from  1  to  30  items  by  means 
of  Bayesian  ability  estimates  as  an  integral  part  of  the  test  administration 
procedure. 

Criterion  test.  The  same  50-item  multiple-choice  conventional  test  was 
used  as  the  criterion  test  for  both  the  adaptive  and  conventional  test  groups. 
The  criterion  test  was  formed  by  selecting  items  measuring  word  knowledge  from 
obsolete  forms  of  the  ASVAB.  This  test  contained  four-alternative  multiple- 
choice  items  and  was  administered  on  the  CRT  immediately  following  administra¬ 
tion  of  the  two  30-item  experimental  tests.  The  criterion  test  was  scored  by 
number  correct. 


Data  Analysis 


Reliability  and  Validity 

Reliability.  Following  the  analysis  of  Kingsbury  and  Weiss  (1980),  Johnson 
and  Weiss  (1980),  and  McBride  (1980),  reliability  was  indexed  by  the  correla¬ 
tions  between  the  scores  on  the  alternate  forms  for  tests  of  each  length  (1 
through  30  items).  Because  independent  groups  were  used  in  the  present  study, 
observed  differences  in  reliability  correlations  between  the  two  testing  strate¬ 
gies  could  be  tested  for  statistical  significance.  After  using  Fisher's 
transformation  on  the  correlations,  £  tests  were  computed  for  differences  be¬ 
tween  the  reliabilities  of  the  adaptive  test  forms  and  reliabilities  of  the  con¬ 
ventional  test  forms. 

One  question  of  interest  in  the  interpretation  of  these  reliability  corre¬ 
lations  is  the  degree  to  which  the  alternate  forms  of  the  two  testing  strategies 
were  truly  parallel,  since  in  the  study  by  Kingsbury  and  Weiss  (1980)  apparent 
differences  were  observed  in  the  degree  of  parallelism  for  the  adaptive  and  con¬ 
ventional  tests.  To  answer  this  question,  correlated  means  t  tests  were  comput¬ 
ed  to  examine  differences  in  means  and  standard  deviations  of  scores  of  each 
test  length  for  both  testing  strategies. 

Validity.  Scores  from  forms  of  every  length  were  correlated  with  total 
number-correct  scores  from  the  50-item  criterion  test,  separately  for  the  adap¬ 
tive  and  conventional  tests,  and  for  Forms  1  and  2  of  each  test.  All  possible 
pairwise  comparisons  between  the  adaptive  and  conventional  tests  of  correlations 
with  the  criterion  test,  for  forms  of  the  same  length,  were  tested  for  differ¬ 
ences  using  t_  tests.  Since  the  criterion  test  was  the  same  for  both  groups,  and 
other  sources  of  variation  were  controlled,  any  differences  in  validities  be¬ 
tween  the  testing  strategies  were  due  to  the  testing  strategies  or  to  sampling 
error  in  the  sampling  of  examinees  or  abilities. 


As  is  well  known,  validity  is  reduced  by  the  unreliability  of  the  measures 


employed.  The  correction  for  attenuation  results  in  a  validity  coefficient  with 
the  effects  of  reliability  removed.  Consequently,  attenuation  corrected  validi¬ 
ty  coefficients  were  computed  for  tests  of  lengths  5,  10,  15,  20,  25,  and  30 
items.  Reliability  was  assessed  for  the  criterion  test  by  means  of  coefficient 
alpha  and  parallel  forms  reliability  was  used  for  the  experimental  tests. 

Comparisons  of  item  characteristics.  Previous  research  comparing  adaptive 
and  conventional  tests  (e.g.,  Kingsbury  &  Weiss,  1980;  Thompson  &  Weiss,  1980) 
has  frequently  used  independent  item  pools  for  each  testing  strategy,  thus  ren¬ 
dering  comparisons  of  the  results  difficult  since  observed  differences  in  reli¬ 
abilities  and/or  validities  may  be  due  to  differing  item  discriminations  used 
for  the  different  testing  strategies.  Even  when  the  same  item  pool  has  been 
used  in  independent  groups  (e.g.,  Johnson  &  Weiss,  1980;  McBride,  1980),  the 
higher  reliabilities  and/or  validities  for  the  adaptive  test  may  be  a  result  of 
their  selection  of  the  most  discriminating  items  in  the  pool,  resulting  in 
scores  based  on  more  discriminating  items  than  for  the  conventional  tests. 

To  determine  whether  this  occurred  with  the  present  data,  means  and  stan¬ 
dard  deviations  of  the  item  parameter  estimates  for  the  conventional  tests  were 
compared  with  those  for  the  adaptive  tests  based  on  items  actually  administered 
by  the  adaptive  procedure.  Thus,  item  parameter  descriptive  statistics  were 
computed  prior  to  testing  for  the  conventional  test  forms,  but  were  computed 
after  the  data  were  collected  for  the  adaptive  forms. 

Testing  Time 

To  compare  the  amounts  of  testing  time  required  by  conventional  and  adap¬ 
tive  tests,  cumulative  item  response  latencies  in  seconds  (i.e.,  total  testing 
time  excluding  instructions)  were  analyzed  UBing  two-way  analysis  of  variance 
with  four  levels  of  ability  and  the  two  testing  strategies  as  the  independent 
variables.  Ability  levels  were  arbitrarily  defined  such  that  Level  1  included 
examinees  of  estimated  ability  below  6  *  -1.0,  Level  2  between  6  -  -1.0  and  0, 
Level  3  between  9-0  and  1.0,  and  Level  A  above  P  **  1.0.  Ability  levels  in  the 
conventional  test  group  were  defined  so  as  to  make  the  distribution  of  examinees 
in  the  four  levels  as  similar  to  that  of  the  adaptive  test  group  as  possible. 
Separate  analyses  were  performed  for  test  lengths  of  5,  10,  15,  20,  25,  and  30 
items. 

RESULTS 

Reliability 

Alternate  Forms  Correlations 

Alternate  forms  correlations  were  computed  using  scores  on  the  two  forms  of 
the  Bayesian  adaptive  tests  and  on  the  two  forms  of  the  conventional  tests,  as  a 
function  of  test  length;  these  data  are  plotted  in  Figure  2  (numerical  values 
are  shown  in  Appendix  Tables  E  and  F).  As  Figure  2  shows,  the  Bayesian  scores 
for  the  two  adaptive  tests  correlated  .45  after  one  item,  increased  rapidly  to 
.78  after  7  items,  then  increased  more  slowly  to  .90  after  all  30  items  were 
administered.  The  scores  on  the  two  forms  of  the  conventional  test  correlated 
.16  after  one  item,  dropped  to  .13  after  the  second  item,  increased  to  .76  after 


12  Items  and  then  more  slowly  to  .89  after  all  30  items  were  administered.  Af¬ 
ter  using  Fisher* 8  £  transformation,  £  tests  for  differences  between  the  reli¬ 
ability  correlations  were  computed.  These  £  tests  show  that  for  each  test 
length  up  to  19  items  (i.e.,  values  to  the  left  of  the  vertical  dashed  line  in 
Figure  2)  the  adaptive  forms  correlated  significantly  higher  (j>  £  .05)  with  each 
other  than  did  the  conventional  forms.  Also,  for  all  test  lengths,  the  alter¬ 
nate  forms  reliabilities  of  the  adaptive  tests  were  higher  than  the  reliabili¬ 
ties  of  the  conventional  tests.  The  horizontal  dashed  line  in  Figure  2  also 
shows  that  the  adaptive  test  required  only  9  items  to  achieve  the  same  alternate 
forms  reliability  (.80)  as  a  17-item  conventional  test. 


Figure  2 

Alternate  Forms  Reliability  Correlations  for  the  Adaptive  (N«263) 
and  Conventional  (N-267)  Tests,  as  a  Function  of 
the  Number  of  Items  Administered 


1  3  5  7  #  11  13  15  17  19  21  23  25  27  29 


Number  of  Items  Administered 


Parallelism  of  the  Alternate  Forms 

Adaptive  tests.  Means,  variances,  skewness,  and  kurtosis  statistics  for 
the  scores  on  the  two  forms  of  the  Bayesian  adaptive  test  are  listed  in  Appendix 
Table  E.  Figure  3  shows  the  mean  scores  for  the  two  forms  of  the  adaptive  test; 
after  the  first  item  the  mean  Bayesian  score  for  Form  1  was  -.05,  and  for  Form 
2,  it  was  -.18.  Mean  scores  for  both  forms  rose  until  the  5th  item  for  Form  1 
and  the  8th  for  Form  2,  after  which  they  were  fairly  stable.  After  the  18th 
item  there  is  a  pronounced  trend  for  the  scores  from  the  two  forms  to  further 
converge.  After  all  30  items  were  administered,  the  mean  score  on  Form  1  was 
.06,  whereas  the  mean  score  on  Form  2  was  .04.  Scores  were  statistically  sig¬ 
nificantly  (j>  £  .05)  different,  using  correlated  means  £  tests  only  for  tests  of 
one  item  and  four  items  in  length.  At  all  other  lengths,  the  adaptive  forms 
showed  no  significant  (j>  £  .05)  differences  in  Bayesian  ability  estimates  be¬ 
tween  the  two  test  forms. 
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Figure  3 

Mean  Bayesian  Ability  Estimates  for  Forms  1  and  2  of  the 
Adaptive  Test,  as  a  Function  of  Number  of  Items  Administered 


Number  of  Item*  Administered 


A  somewhat  similar  pattern  Is  seen  in  the  standard  deviations  of  the  Bayes¬ 
ian  adaptive  test  scores  for  the  two  forms  (Figure  4;  numerical  data  are  in  Ap¬ 
pendix  Table  E).  SDs  after  one  item  were  .63  and  .64,  respectively,  rising 
quickly  to  .84  after  five  items.  Tests  of  lengths  from  6  to  30  items  had  score 
SDs  slowly  increasing  to  .90  and  .88  for  Form  1  and  Form  2,  respectively.  Un¬ 
like  the  means,  which  tended  to  converge,  the  SDs  showed  a  slight  divergence 
with  increasing  test  length.  However,  using  a  correlated  variances  test 
(McNemar,  1969,  p.  282),  none  of  the  differences  in  variances  between  the  alter¬ 
nate  forms  were  statistically  significant  at  any  of  the  test  lengths. 


Mean  Bayesian  posterior  variances  were  highly  similar  for  the  two  forms  for 
all  test  lengths,  as  shown  in  Table  1.  Mean  posterior  variances  after  the  first 
item  were  .59  for  Form  1  and  .60  for  Form  2  and  proceeded  smoothly  to  .05  for 
both  forms  after  25  items. 
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Figure  4 

Standard  Deviations  of  Bayesian  Ability  Estimates  for  Forms 
1  and  2  of  the  Adaptive  Test,  as  a  Function  of  Number  of 
Items  Administered 


Conventional  test.  Figures  5  and  6  (and  Appendix  Table  F)  show  data  per¬ 
taining  to  the  parallelism  of  the  conventional  test.  Figure  5  shows  that  the 
mean  proportion-correct  score  on  Form  1  of  the  conventional  test  after  30  items 
was  .65  for  Form  1,  and  .64  for  Form  2.  Correlated  means  t^  tests  for  score  dif¬ 
ferences  between  mean  number-correct  scores  on  the  two  alternate  forms  of  the 
conventional  test  (see  Appendix  Table  F)  showed  that  the  means  of  the  conven¬ 
tional  forms  were  significantly  different  for  29  of  the  30  t^  tests  at  a  signifi¬ 
cance  level  of  j>  .05;  of  these  29,  27  were  significantly  different  at  j> <_ 

.001.  There  was  no  significant  difference  in  mean  number-correct  scores  only  at 
a  test  length  of  14  items.  Thus,  although  the  two  forms  of  the  conventional 
test  were  designed  to  be  weakly  parallel  (see  Figure  1),  their  mean  scores  did 
not  meet  the  classical  definition  of  parallel  tests.  Unlike  the  results  for  the 
Bayesian  adaptive  forms,  there  was  little  tendency  toward  score  convergence  for 
the  two  conventional  forms  with  increasing  test  length,  as  mean  absolute  jt  val¬ 
ues  remained  high  through  a  test  length  of  29  items. 

Figure  6  plots  the  number-correct  standard  deviations  for  the  two  conven¬ 
tional  forms  with  Increasing  numbers  of  items.  Form  2  showed  somewhat  greater 
standard  deviations  at  almost  all  test  lengths;  however,  after  28  items  the 
standard  deviations  converged  to  .17.  In  contrast  to  the  adaptive  tests,  sig- 


Table  1 


Means  and  Standard  Deviations  of 
Bayesian  Posterior  Variances  for 
the  Two  Forms  of  the  Adaptive  Tests 


Test 

Length 

Form  1 

Form  2 

Mean 

SD 

Mean 

SD 

1 

.59 

.045 

.60 

.052 

2 

.40 

.055 

.40 

.052 

3 

.30 

.046 

.30 

.041 

4 

.24 

.037 

.23 

.024 

5 

.20 

.032 

.20 

.021 

6 

.17 

.025 

.17 

.019 

7 

.15 

.022 

.15 

.017 

8 

.13 

.019 

.13 

.015 

9 

.12 

.017 

.12 

.013 

10 

.11 

.014 

.11 

.011 

11 

.10 

.013 

.10 

.010 

12 

.09 

.012 

.09 

.010 

13 

.09 

.011 

.09 

.009 

14 

.08 

.010 

.08 

.009 

15 

.08 

.010 

.08 

.008 

16 

.07 

.009 

.07 

.008 

17 

.07 

.009 

.07 

.007 

18 

.07 

.008 

.07 

.007 

19 

.06 

.008 

.06 

.007 

20 

.06 

.008 

.06 

.007 

21 

.06 

.007 

.06 

.006 

22 

.06 

.007 

.06 

.006 

23 

.06 

.007 

.06 

.006 

24 

.06 

.007 

.06 

.006 

25 

.05 

.007 

.05 

.006 

26 

.05 

.006 

.05 

.006 

27 

.05 

.006 

.05 

.006 

28 

.05 

.006 

.05 

.006 

29 

.05 

.006 

.05 

.006 

30 

.05 

.006 

.05 

.006 

nif leant  differences  In  variances  of  the  alternate  forms  were  observed  at  22  of 
the  test  lengths  examined.  With  the  exception  of  two-item  tests,  the  conven¬ 
tional  alternate  forms  had  statistically  significant  differences  in  variances  at 
test  lengths  through  23  items. 


Validity 

Correlation  with  Criterion  Test  Scores 


Scores  from  each  form  of  both  the  adaptive  and  conventional  tests  at 
lengths  from  one  to  30  items  were  correlated  with  number-correct  scores  on  the 
50-ltem  criterion  test.  These  correlations  are  plotted  in  Figure  7;  numerical 


Figure  5 

Mean  Proportion-Correct  Scores  for  Forms  1  and  2  of  the 
Conventional  Test ,  as  a  Function  of  Number  of  Items  Administered 


data  are  in  Appendix  Table  G.  Both  adaptive  forms  correlated  .39  with  criterion 
test  scores  after  one  Item  was  administered,  rising  to  .84  after  all  30  items 
were  administered.  Scores  on  the  two  forms  of  the  conventional  tests  correlated 
.28  and  .31,  respectively,  with  criterion  test  scores  after  one  item  and  .80  and 
.81,  respectively,  after  30  items  were  administered.  As  shown  by  the  dashed 
horizontal  line  in  Figure  7,  scores  on  Forms  1  and  2  of  the  adaptive  tests  cor¬ 
related  .80  with  scores  on  the  criterion  test  after  10  items  and  11  Items,  re¬ 
spectively,  whereas  scores  on  the  two  forms  of  the  conventional  test  required  30 
items  and  28  items,  respectively,  to  achieve  the  same  level  of  validity. 

Appendix  Table  G  also  shows  results  of  the  pairwise  comparisons  (between 
forms  of  the  adaptive  and  conventional  tests)  of  the  correlations  with  criterion 
test  scores.  In  all  120  comparisons,  scores  on  the  adaptive  tests  correlated 
more  highly  with  scores  on  the  criterion  test  than  did  scores  on  the  convention¬ 
al  test.  Although  some  of  the  differences  were  slight,  43  of  them  were  suffi¬ 
ciently  large  to  be  statistically  significant  at  the  .05  level.  Most  of  the 
significant  differences  occurred  at  test  lengths  of  18  items  or  less. 

Attenuation-Corrected  Correlations 


Table  2  shows  validity  correlations  from  Appendix  Table  G  for  tests  of 


Figure  6 

Standard  Deviations  of  Proportion-Correct  Scores  for 
Forns  1  and  2  of  the  Conventional  Test,  as  a  Function  of 
Number  of  Items  Administered 


0  5  10  15  20  25  30 


Number  of  Items  Administered 

length  5,  10,  15,  20,  25,  and  30  items  that  have  been  corrected  for  attenuation 
caused  by  Imperfect  reliability  in  both  the  experimental  tests  and  the  criterion 
test.  Alpha  reliability  for  the  50-item  criterion  test  was  .85  in  both  the 
adaptive  and  conventional  test  groups;  for  these  computations  for  experimental 
tests  of  a  given  length,  alternate  forms  reliabilities  were  used  (Appendix  Ta¬ 
bles  E  and  F).  Overall,  scores  on  the  Bayesian  adaptive  tests  showed  higher 
attenuation-corrected  validity  correlations  than  did  scores  on  the  conventional 
tests.  (The  corrected  correlation  of  1.07  for  Form  2  of  the  conventional  test 
at  five  items  was  a  result  of  sampling  artifacts).  For  example,  at  the  15-item 
test  length,  average  corrected  validities  for  the  adaptive  tests  were  .97;  those 
for  the  conventional  tests  averaged  .92;  at  25  items,  average  validities  were 
.97  for  the  adaptive  tests  and  .915  for  the  conventional  tests.  The  implication 
of  these  corrected  correlation  coefficients  aeons  to  be  that  the  ability  dimen¬ 
sion  that  was  measured  by  the  criterion  test  was  more  nearly  identical  to  that 
measured  by  the  Bayesian  adaptive  tests  than  by  number-correct  scores  on  the 
conventional  test,  l.e.,  Bayesian  adaptive  scores  contained  less  error  and  spe¬ 
cific  variance  than  did  number-correct  scores  on  conventional  tests  of  the  same 
length. 


Correlation  With  Criterion 
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Figure  7 

Validity  Correlations  with  the  Criterion  Test  for  Two  Forms 
of  the  Adaptive  Test  and  Two  Forms  of  the  Conventional  Test, 
as  a  Function  of  Number  of  Items  Administered 


Table  2 

Validity  Correlations  Corrected  for 
Attenuation  for  Forms  1  and  2  of 
the  Adaptive  and  Conventional  Tests, 
as  a  Function  of  Test  Length 


Test  Adaptive  Conventional 


Length 

Form  1 

Form  2 

Form  1 

Form  2 

5 

.93 

.91 

.90 

1.07 

10 

.96 

.95 

.93 

.95 

15 

.97 

.97 

.89 

.95 

20 

.97 

.97 

.92 

.93 

25 

.97 

.97 

.91 

.92 

30 

.96 

.96 

.92 

.93 

Characteristics  of  Items  Administered 

Item  parameter  means  for  each  form  of  the  adaptive  and  conventional  tests 
are  given  in  Table  3.  The  mean  discrimination  (a)  parameter  for  items  actually 
administered  averaged  across  examinees  for  the  two  adaptive  test  forms  were  £  * 
1.33  and  1.32,  respectively;  for  the  conventional  forms  the  mean  £  was  1.42  and 
1.46,  respectively.  Thus,  on  the  average,  the  conventional  tests  administered 
more  discriminating  items  than  did  the  adaptive  tests.  Table  3  also  shows  that 
the  mean  difficulties  of  the  items  administered  in  the  adaptive  tests  were  b  - 
.06  for  Form  1  and  J>  ■  -.15  for  Form  2,  whereas  those  of  the  conventional  test 
were  J>  ■  -.50  and  -.32.  Thus,  the  adaptive  tests  were,  on  the  average,  more 
difficult  than  the  conventional  tests,  but  their  difficulty  was  closer  to  the 
mean  for  the  population  on  which  the  items  were  calibrated.  All  four  tests  ad¬ 
ministered  items  with  mean  c  ■  .11. 


Table  3 

Means  and  Standard  Deviations  of  the  Item  Parameters 
for  the  Adaptive  and  Conventional  Forms 


Adaptive 

Conventional 

Form 

1 

Form 

2 

Form  1 

Form  2 

Paramet-er 

Mean 

SD 

Mean 

SD 

Mean  SD 

Mean  SD 

a 

1.33 

.34 

1.32  . 

35 

1.42  .49 

1.46  .40 

b 

.06 

.90 

-.15  . 

93 

-.50  1.16 

-.32  1.22 

£ 

.11 

.11  . 

05 

.11  .06 

.11  .07 

Table  4  contains  means  and  standard  deviations  for  the  discrimination  (£> 
parameter  for  each  sequential  position  of  the  adaptive  and  conventional  test. 
Mean  £  values  were  high  in  the  early  part  of  the  adaptive  test  but  decreased 
steadily  with  increasing  test  length.  The  highest  mean  £  (1.984)  occurred  In 
the  third  sequential  position,  while  the  lowest  (1.077)  occurred  in  the  30th  and 
last  sequential  position.  Thus,  the  adaptive  test  used  the  "best"  items  in  the 
pool  early  in  the  test  and ,  as  test  length  increased ,  used  items  of  lower  dis¬ 
crimination.  The  pattern  was  similar  but  not  as  smooth  for  the  conventional 
test,  where  more  highly  discriminating  items  tended  to  occur  earlier  in  the 
test.  For  20  of  the  30  test  lengths,  the  mean  item  discrimination  for  the  con¬ 
ventional  test  was  higher  than  those  of  the  adaptive  test. 

Additional  Results 

Testing  Time 


Table  5  presents  cumulative  item  response  latencies  (i.e.,  net  testing 
time)  in  minutes  for  each  form  of  the  adaptive  and  conventional  tests,  for  each 
of  four  ability  levels.  The  adaptive  tests  consistently  resulted  in  higher  mean 
net  testing  times  than  did  the  conventional  tests  for  the  highest  ability  level 
group  (Level  4).  Examinees  in  the  lower  half  of  the.  ability  distribution  showed 
no  differences  in  mean  net  testing  times  between  adaptive  and  conventional  tests 
for  tests  of  any  length.  As  tests  became  longer,  differences  in  mean  net  test¬ 
ing  time  increased  substantially  for  examinees  of  high  ability  and  increased 


Table  A 

Means  and  Standard  Deviations 
of  the  Discrimination  (a) 
Parameter  for  Each  Sequential 
Position  of  the  Adaptive  and 
Conventional  Tests  for  Both 
Forms  Combined 


Test  Adaptive  Conventional 

Length  Mean  SD  Mean 


1 

1.A70 

.050 

1 .  A 1 

2 

1.715 

2.52 

3 

1.98A 

.788 

1.58 

A 

1.660 

.398 

1.80 

5 

1.579 

.315 

2.05 

6 

1.535 

.262 

1.A2 

7 

1.53A 

.369 

1.60 

8 

1.A58 

.270 

1.59 

9 

1.A27 

.253 

2.29 

10 

1.A00 

.277 

2.29 

11 

1.35A 

.2A2 

1.  A0 

12 

1.336 

.230 

1.A5 

13 

1.330 

.237 

1.29 

1A 

1.289 

.  18A 

1.79 

15 

1.273 

.199 

1.28 

16 

1.2AA 

.212 

1.26 

17 

1.221 

.175 

1.18 

18 

1.201 

.171 

KA6 

19 

1.20A 

.177 

1.  A0 

20 

1.19A 

.217 

1.  A0 

21 

1.179 

.225 

1.37 

22 

1.169 

.220 

1.16 

23 

1.158 

.231 

1.38 

2A 

1.150 

.220 

1.20 

25 

1.132 

.205 

1.21 

26 

1.123 

.202 

.8A 

27 

1.089 

.175 

1.29 

28 

1.095 

.192 

1.1A 

29 

1.088 

.187 

1.0A 

30 

1.077 

.169 

1.17 

Note.  Standard  deviations  are  not 
presented  for  the  conventional 
group,  since  they  are  based  on  only 
two  values,  one  from  each  form. 

somewhat  less  for  examinees  of  moderately  high  ability,  in  favor  of  the  conven¬ 
tional  test  condition;  at  the  30-ltem  length,  examinees  on  the  adaptive  test  at 
the  highest  level  of  ability  required  about  75%  more  time  to  respond  to  the 
items,  on  the  average,  than  did  examinees  on  the  conventional  test.  For  the 
combined  ability  groups  at  the  30-item  length  the  adaptive  test  group  required 
17%  longer  to  respond  to  the  items.  Net  testing  time  differences  were  more  pro 
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Table  5 

Means  and  Standard  Deviations  of  Total  Response  Latencies  in  Minutes 
(Net  Testing  Time)  for  Tests  of  Lengths  5,  10,  15,  20,  25,  and  30  Items 
for  Two  Forms  of  the  Adaptive  and  Conventional  Tests 
at  Four  Levels  of  Ability  and  for  Combined  Ability  Groups 


Test  Length 
and 

Adaptive 

Conventional 

Ability 

Form  1 

Form  2 

> 

Form  1 

Form  2 

Level* 

N 

Mean 

SD 

N 

Mean 

SD 

N 

Mean 

SD 

N 

Mean 

SD 

5  Items 

1 

30 

2.80 

2.27 

35 

2.44 

1.01 

47 

2.53 

.83 

35 

2.66 

1.04 

2 

81 

1.95 

.94 

85 

1.84 

1.17 

101 

1.96 

1.02 

55 

2.24 

.89 

3 

no 

1.78 

.76 

111 

1.90 

.87 

86 

1.70 

.69 

153 

1.93 

.94 

4 

30 

2.41 

1.10 

20 

2.03 

1.03 

17 

1.23 

.80 

8 

1.45 

.65 

Combined 

251 

2.03 

1.18 

251 

1.97 

1.03 

251 

1.93 

.93 

251 

2.08 

.97 

10  Items 

1 

29 

4.72 

2.81 

32 

4.53 

2.60 

21 

4.71 

1.28 

37 

4.48 

1.44 

2 

89 

3.77 

1.54 

86 

3.74 

1.28 

74 

4.18 

1.91 

77 

4.48 

2.04 

3 

97 

3.89 

1.62 

103 

3.71 

1.42 

110 

3.20 

1.08 

97 

3.53 

1.22 

4 

36 

4.06 

1.90 

30 

4.09 

1.46 

46 

2.49 

1.12 

40 

2.53 

.92 

Combined 

251 

3.97 

1.82 

251 

3.87 

1.60 

251 

3.48 

1.56 

251 

3.80 

1.66 

15  Items 

1 

32 

6.77 

3.70 

36 

6.18 

3.19 

23 

6.57 

1.93 

43 

6.30 

1.90 

2 

85 

5.48 

1.91 

83 

5.80 

1.96 

84 

5.68 

2.23 

69 

5.96 

2.73 

3 

102 

5.66 

1.93 

103 

5.37 

1.80 

105 

4.52 

1.43 

101 

5.15 

1.83 

4 

32 

6.42 

2.76 

29 

6.17 

2.13 

39 

3.38 

1.37 

38 

3.70 

1.22 

Combined 

251 

5.84 

2.36 

251 

5.72 

2.15 

251 

4.92 

2.00 

251 

5.35 

2.21 

20  Items 

1 

34 

8.56 

4.73 

34 

8.02 

4.03 

27 

8.65 

2.84 

36 

7.88 

2.69 

2 

87 

7.32 

2.44 

80 

7.50 

2.57 

99 

7.24 

2.57 

91 

7.89 

3.16 

3 

98 

7.76 

2.58 

108 

7.25 

2.29 

92 

5.75 

1.83 

100 

6.57 

2.19 

4 

32 

7.74 

3.17 

29 

8.14 

3.12 

33 

3.97 

1.10 

24 

4.37 

1.11 

Combined 

251 

7.71 

3.00 

251 

7.54 

2.77 

251 

6.41 

2.56 

251 

7.02 

2.78 

25  Items 

1 

36 

10.17 

5.33 

35 

9.74 

4.66 

32 

10.30 

3.48 

32 

9.33 

3.29 

2 

85 

9.11 

2.82 

77 

9.06 

3.09 

81 

9.46 

3.31 

73 

9.36 

3.68 

3 

99 

9.43 

3.10 

107 

9.05 

2.77 

110 

7.39 

2.28 

111 

7.78 

2.65 

4 

31 

9.71 

3.58 

32 

9.86 

3.71 

28 

5.00 

1.36 

35 

5.93 

1.94 

Combined 

251 

9.46 

3.47 

251 

9.25 

3.30 

251 

8.16 

3.16 

251 

8.18 

3.19 

30  Items 

1 

34 

12.21 

6.32 

35 

11.32 

5.25 

36 

12.41 

4.72 

30 

12.48 

5.10 

2 

88 

10.57 

3.27 

82 

10.63 

3.50 

82 

10.39 

3.17 

77 

10.15 

3.37 

3 

96 

11.10 

3.53 

101 

10.63 

3.19 

100 

8.38 

2.58 

109 

8.84 

2.94 

4 

33 

11.48 

3.89 

33 

11.74 

4.09 

33 

6.29 

1.86 

35 

6.82 

2.03 

Combined 

251 

11.11 

3.99 

251 

10.87 

3.76 

251 

9.34 

3.57 

251 

9.40 

3.62 

*For  the  adaptive  test.  Level  l  ■  8  <_  -2.0;  Level  2  -  -2.0  £  fl  <  0.0; 
Level  3  ■  ^  0.0  9  <  1.0;  and  Level  4  ■  6  >  1.0.  For  the  conventional 
tests,  the  score  distributions  were  approximately  matched  to  those  of 
the  adaptive  tests. 


nounced  with  increasing  test  length,  since  for  the  combined  group  at  the  5-item 
length  there  were  essentially  no  differences,  and  at  the  10-item  length  adaptive 
tests  took  only  8 Z  longer.  For  the  conventional  test  group,  net  testing  time 
was  strongly  related  to  ability;  as  ability  level  increased,  net  testing  time 
decreased.  This  was  not  the  case  in  the  adaptive  test  group  where  mean  net 
testing  times  tended  to  be  greater  for  the  highest  and  lowest  ability  levels  and 
somewhat  less  for  middle  ability  levels. 


Table  6 

Two-way  Analysis  of  Variance  of  Net  Testing  Time 
by  Ability  Level  and  Testing  Strategy  (for  Data  in  Table  5) 


Test  Length 

Form  1 

Form  2 

and  Effect 

DF 

MS 

F 

£ 

DF 

MS 

F 

£ 

5  Items 

Ability  (A) 

3 

15.2 

15.0 

.001 

3 

7.9 

8.3 

.001 

Strategy  (S) 

1 

3.6 

3.5 

.061 

1 

2.0 

2.1 

.145 

A  x  S 

3 

4.4 

4.4 

.005 

3 

2.1 

2.2 

.088 

Residual 

494 

1.0 

494 

1.0 

Total 

501 

1.1 

501 

1.0 

10  Items 

Ability  (A) 

3 

27.2 

10.4 

.001 

3 

26.1 

10.9 

.001 

Strategy  (S) 

1 

21.0 

8.0 

.005 

1 

.4 

.2 

.692 

A  x  S 

3 

20.0 

7.6 

.001 

3 

21.8 

9.1 

.001 

Residual 

494 

2.6 

494 

2.4 

Total 

501 

2.9 

501 

2.7 

15  Items 

Ability  (A) 

3 

44.1 

10.3 

.001 

3 

36.5 

8.3 

.001 

Strategy  (S) 

1 

91.1 

21.2 

.001 

1 

15.5 

3.5 

.062 

A  x  S 

3 

47.1 

11.0 

.001 

3 

29.5 

6.7 

.001 

Residual 

494 

4.3 

494 

4.4 

Total 

501 

5.0 

501 

4.8 

20  Items 

Ability  (A) 

3 

86.0 

12.5 

.001 

3 

44.4 

6.2 

.001 

Strategy  (S) 

1 

203.0 

29.4 

.001 

1 

40.3 

5.6 

.018 

A  x  s 

3 

73.4 

10.6 

.001 

3 

59.2 

8.3 

.001 

Residual 

494 

6.9 

494 

7.2 

Total 

501 

8.2 

501 

7.8 

25  Items 

Ability  (A) 

3 

103.8 

10.6 

.001 

3 

50.0 

5.0 

.002 

Strategy  (S) 

1 

200.9 

20.5 

.001 

1 

136.4 

13.7 

.001 

A  x  s 

3 

116.0 

11.8 

.001 

3 

71.9 

7.2 

.001 

Residual 

494 

9.8 

494 

9.9 

Total 

501 

11.4 

501 

10.8 

30  Items 

Ability  (A) 

3 

162.5 

12.8 

.001 

3 

93.2 

7.5 

.001 

Strategy  (S) 

1 

396.0 

31.1 

.001 

1 

251.9 

20.2 

.001 

A  x  s 

3 

137.1 

10.8 

.001 

3 

119.1 

9.5 

.001 

Residual 

494 

12.7 

494 

12.5 

Total 

501 

15.1 

501 

14.1 
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Table  6  presents  two-way  Anova  results  for  the  data  in  Table  5.  At  the 
5-item  length  the  only  main  effect  that  was  significant  for  both  forms  was  abil¬ 
ity  level;  at  the  10-  and  15-item  length.  Form  1  additionally  showed  a  signifi¬ 
cant  main  effect  for  testing  strategy,  but  this  was  not  a  significant  main  ef¬ 
fect  for  Form  2.  For  tests  longer  than  15  items  both  ability  level  and  testing 
strategy  were  significant  (j>  _<  .06).  For  all  test  lengths  (except  Form  2  at  5 
items)  the  ability  level  by  testing  strategy  interaction  was  significant. 

Effectiveness  of  the  Instructions 


Table  7  shows  for  each  instructional  screen  the  number  of  times  each  error 
screen  was  presented.  (Instructional  screens  are  in  Appendix  Table  A;  error 
screens  are  in  Appendix  Table  B.)  As  can  be  seen  in  Table  7,  examinees  had  the 
greatest  difficulty  when  they  were  required  to  use  the  “SHIFT"  key.  Instruc¬ 
tional  Screen  9985  required  examinees  to  change  a  typed  "5“  to  a  "4,"  which  re¬ 
quired  the  use  of  both  the  "SHIFT"  key  and  the  “RUB(out)"  key;  this  screen  re¬ 
sulted  in  217  errors.  Instructional  Screen  9987,  which  required  typing  a  ques¬ 
tion  mark  (again  requiring  the  "SHIFT”  key)  resulted  in  122  errors.  Otherwise, 
there  were  only  scattered  errors,  mostly  in  response  to  the  five  sample  verbal 
test  items  (Screens  9212,  9215,  9217,  9219,  9222).  The  five  sample  items  re¬ 
sulted  in  188  errors  altogether.  An  unknown  hardware  or  software  problem  caused 
Error  Screen  9213  to  be  presented  16  times  in  response  to  Instructional  Screen 
9211,  for  which  the  proper  Error  Screen  was  9902. 

Error  screens  could  also  occur  in  response  to  other  error  screens.  Appen¬ 
dix  Table  H  gives  a  similar  breakdown  of  these  errors.  Altogether  there  were 
161  such  errors,  with  76  of  them  resulting  from  Screen  9904  (second  attempt  to 
change  a  response).  Since  errors  resulting  in  Screens  9060  and  9061  were  proc¬ 
tor  errors,  only  131  of  these  errors  were  made  by  the  recruits. 

Table  8 

Number  of  Error  Screens  Encountered 
by  Examinees  During  the  Instructional 
Screen  Sequence  for  Total  Group 
(N  =  531) 


Number 

of 

Errors 

Testees 

0 

1 

2 

3 

4 

5 

6 

7 

>  8 

Number 

175 

150 

45 

48 

51 

22 

14 

18 

8 

Percent 

.33 

.28  . 

08 

.09  . 

10 

.04 

.03 

.03 

.02 

Table  8  shows  the  distribution  of  the  number  of  errors  committed  during  the 
instructions.  One-third  of  the  examinees  had  no  errors  during  the  instructional 
sequence,  while  28%  had  only  one  error.  Only  2%  of  the  examinees  made  eight  or 
more  errors  while  progressing  through  the  instructional  sequence.  Mean  number 
of  errors  of  any  kind  per  examinee  was  1.56. 

Means  and  standard  deviations  for  the  time  it  took  the  examinees  to  com¬ 
plete  the  instructional  sequence  are  given  in  Table  9.  The  adaptive  test  group 
required  10.60  minutes  to  complete  the  sequence,  while  the  conventional  test 
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Table  9 

Means  and  Standard  Deviations 
In  Minutes  for  Time  Required 
to  Complete  Instructional 
Sequence 


Group 

Mean 

SD 

Adaptive 

10.60 

3.96 

Conventional 

10.64 

4.27 

Total 

10.62 

4.12 

group  required  10.64  minutes.  The  mean  time  to  complete  the  instructional  se¬ 
quence  for  the  total  group  was  10.62  minutes. 


DISCUSSION  AND  CONCLUSIONS 


Reliability  and  Validity 

The  adaptive  tests  were  substantially  more  parallel  than  the  conventional 
tests,  which  may  have  affected  the  alternate  forms  reliability  correlations  for 
the  conventional  tests.  For  almost  all  test  lengths,  score  means  on  the  conven¬ 
tional  tests  were  significantly  different  from  each  other,  whereas  significant 
differences  in  score  means  were  generally  not  observed  for  the  adaptive  tests. 
The  adaptive  tests  achieved  an  alternate  forms  reliability  correlation  of  .80 
after  only  9  items;  the  conventional  tests  required  17  items  to  achieve  the  same 
reliability.  Also,  for  all  test  lengths  up  to  19  items  the  adaptive  tests  had 
significantly  higher  reliabilities  than  the  conventional  tests.  Thus,  except 
for  the  lack  of  parallelism  in  the  conventional  tests,  the  results  of  this  study 
support  theoretical  predictions  that  fewer  items  are  required  to  achieve  a  given 
level  of  measurement  precision  using  adaptive,  as  opposed  to  conventional, 
tests. 


The  reliability  of  the  Bayesian  test  scores  at  30  items  was  only  .04  higher 
than  that  at  15  items  for  these  tests,  but  was  .12  higher  for  the  conventional 
tests.  One  reason  why  the  reliabilities  of  the  scores  from  the  Bayesian  tests 
did  not  continue  to  Increase  as  test  length  increased  may  have  been  the  declin¬ 
ing  discriminations  of  the  items  available  in  the  item  pool  with  increasing  test 
length.  By  contrast,  there  was  greater  similarity  in  the  conventional  test  dis¬ 
criminations  throughout. 

Correlations  with  the  criterion  test  were  consistently  higher  for  the  adap¬ 
tive  tests  than  for  the  conventional  tests.  To  achieve  a  validity  correlation 
of  .80  required  an  average  of  10.5  items  for  the  adaptive  test  scores;  however, 
to  achieve  the  same  correlation,  an  average  of  29  conventionally  administered 
items  was  required.  Adaptive  test  score  validities  increased  rapidly  for  scores 
based  on  tests  of  up  to  15  items  in  length  but  showed  little  Improvement  after 
that.  Again,  this  may  have  been  attributable  to  few  items  with  high  discrimina¬ 
tions  being  available  for  selection  in  the  last  half  of  the  adaptive  test. 
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The  lower  reliabilities  of  the  conventional  tests  may  be  one  explanation 
for  the  validity  differences  between  testing  strategies.  However,  when  the  va¬ 
lidity  correlations  were  corrected  for  attenuation,  validity  differences  still 
favored  the  adaptive  strategy.  While  the  reliability  differences  between  the 
two  testing  strategies  might  have  been  clouded  by  the  less  parallel  nature  of 
the  conventional  tests,  the  validity  results  were  not  dependent  upon  parallel¬ 
ism,  since  validity  correlations  were  computed  separately  for  each  form  of  each 
test.  Although  differences  in  item  discriminations  might  have  caused  validity 
differences,  the  conventional  test  item  discriminations  were  generally  higher 
than  those  of  the  adaptive  test;  observed  validity  differences  were  in  the  oppo¬ 
site  direction. 

The  results  of  this  study  are  contrary  to  those  of  Johnson  and  Weiss  (1980) 
and  Kingsbury  and  Weiss  (1980)  using  somewhat  similar  research  designs. 

Kingsbury  and  Weiss  (1980)  did  not  employ  an  independent  groups  design,  and 
their  examinees  received  249  items,  which  may  have  introduced  fatigue  effects. 
Johnson  and  Weiss  (1980)  did  employ  an  independent  groups  design,  which  should 
have  eliminated  any  fatigue  effects.  Both  of  these  studies  used  college  student 
volunteers;  this  may  have  restricted  the  range  of  ability  and  thus  affected  the 
resulting  correlations.  The  present  study,  however,  investigated  testing  strat¬ 
egy  effects  on  Marine  recruits,  who  represent  a  wider  distribution  of  ability 
than  college  students.  Also,  the  items  used  in  this  study  were  parameterized  on 
samples  of  980  to  2,200  recruits,  which  is  much  larger  than  were  used  in  the 
other  two  studies.  Since  the  size  of  the  parameterization  sample  is  strongly 
related  to  the  accuracy  of  the  resultant  item  parameters  (Schmidt  &  Urry,  1976), 
and  since  it  should  be  expected  that  IRT-based  item  selection  and  person  scoring 
strategies  would  be  sensitive  to  the  quality  of  the  item  parameter  estimates,  it 
is  likely  that  differences  in  the  quality  of  the  item  parameters  led  to  the  dif¬ 
ferent  results  of  these  studies.  In  addition,  the  experimental  conventional 
tests  used  by  both  Kingsbury  and  Weiss  (1980)  and  Johnson  and  Weiss  (1980)  were 
peaked,  as  opposed  to  rectangular,  tests,  which  might  also  have  affected  the 
results  in  an  unknown  way.  The  rectangular  tests  used  in  the  present  study  bet¬ 
ter  reflect  the  types  of  ability  tests  currently  in  use  in  military  testing  en¬ 
vironments. 

Testing  Time 

Although  for  some  of  the  shorter  test  lengths  testing  times  were  shorter 
for  the  adaptive  than  for  the  conventional  tests,  in  the  majority  of  comparisons 
the  adaptive  tests  required  more  testing  time  than  the  conventional  tests. 

These  data  support  those  of  Waters  (1977)  and  Johnson,  Weiss,  and  Prestwood 
(1981),  indicating  that  it  takes  an  individual  slightly  longer,  on  the  average, 
to  respond  to  items  on  adaptive  tests  than  to  those  on  a  conventional  test. 

Since  items  on  an  adaptive  test  are  selected  according  to  difficulty  to  be  near 
each  person’s  ability  level,  the  slight  increase  in  testing  time  must  be  judged 
from  within  the  total  context  of  the  testing  procedure.  As  was  seen,  the  longer 
testing  times  for  the  adaptive  procedure  resulted  from  individuals  of  high  abil¬ 
ity  receiving  items  of  appropriate  difficulty  for  them;  however,  in  the  conven¬ 
tional  test,  high  ability  individuals  received  items  that  were  much  too  easy  for 
them,  as  reflected  in  the  very  short  response  latencies. 


While  items  that  are  far  removed  in  difficulty  from  an  individual's  ability 
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level  may  require  less  time  for  a  response,  such  Items  offer  relatively  little 
that  Is  Informative  of  that  Individual's  status  on  the  trait  of  Interest.  Thus, 
testing  time  Is  important  only  In  relation  to  the  psychometric  properties  of  the 
testing  outcome.  If  it  takes  10%  longer  per  item  for  examinees  to  respond  to 
items  selected  by  a  given  testing  strategy,  but  that  strategy  requires  only  half 
as  many  items  to  achieve  a  given  level  of  reliability  or  validity,  then  the  in¬ 
creased  efficiency  of  that  procedure  mitigates  against  the  importance  of  the 
differences  in  testing  time.  For  example,  after  an  average  of  3.54  minutes  of 
testing,  the  adaptive  group  had  responded  to  9  items  and  the  alternate  forms 
reliability  was  .800;  yet,  after  the  conventional  group  had  taken  10  items  in  an 
equivalent  amount  of  time  (3.64  minutes),  this  resulted  in  an  alternate  forms 
reliability  correlation  of  only  .675.  Similarly,  after  the  same  9  items  the 
adaptive  tests  had  an  average  validity  of  .785,  but  after  10  items  the  conven¬ 
tional  tests  had  an  average  validity  of  only  .72.  Thus,  while  the  adaptive 
tests  required  somewhat  more  time,  on  the  average,  to  administer,  they  obtained 
given  levels  of  reliability  and  validity  in  less  time  than  did  the  conventional 
tests. 

Effectiveness  of  the  Instructions 


Analysis  of  errors  made  during  administration  of  the  initial  instructions 
indicated  that  examinees  adjusted  quite  readily  to  CRT-presented  testing.  Using 
the  "RUB(out)"  key  to  change  a  response  and  using  the  "SHIFT”  key  were  the  only 
CRT  operations  that  generated  many  errors.  However,  even  for  these  operations, 
after  the  first  error  there  were  relatively  few  repeated  errors.  These  results 
demonstrate  that  previous  familiarity  with  CRT  operation  is  not  necessary  for 
military  recruits  before  undertaking  a  program  of  computer-administered  adaptive 
testing.  The  sample  items  were  answered  without  difficulty  by  almost  all  of  the 
recruits.  The  majority  of  the  instructional  screens  and  the  sample  items  thus 
appeared  to  function  adequately  in  preparing  the  majority  of  the  examinees  for 
the  tests. 

Conclusions 


The  results  of  this  study  supported  the  feasibility  and  psychometric  supe¬ 
riority  of  computer-administered  adaptive  tests  as  replacements  for  paper-and- 
pencil  administered  conventional  tests  in  a  military  testing  environment.  On  an 
Item-for-item  basis,  the  adaptive  tests  took  slightly  longer  than  the  conven¬ 
tional  tests;  but  with  testing  time  held  constant,  the  adaptive  tests  obtained 
substantially  higher  levels  of  both  reliability  and  validity  than  did  the  con¬ 
ventional  tests.  The  data  showed  that  to  obtain  equal  reliabilities,  adaptive 
tests  could  administer  50%  fewer  items  than  the  conventional  tests;  adaptive 
tests  could  also  achieve  the  same  level  of  validity  as  the  conventional  tests 
using  only  one-third  the  number  of  items,  supporting  earlier  validity  data  re¬ 
ported  by  Thompson  and  Weiss  (1980)  on  college  students.  The  data  also  showed 
that  using  a  realistic  item  pool  with  good  distributions  of  item  parameters,  the 
adaptive  tests  reached  their  maximum  levels  of  validity  after  15  items  had  been 
administered,  although  reliabilities  increased  slowly  beyond  that  length;  this 
supports  the  use  of  short  adaptive  tests  in  practical  applications. 
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Table  A 

Normal  Sequence  of  Instructional  Screens, 
and  Resultant  Error  Screens 


Error 

Screen  Number  and  Contents  Screens 


Screen  9981  9900 

The  tests  you  are  going  to  take  are  being  given  to  you  by  a 
computer.  The  Instructions  for  the  tests  will  appear  on  this 
screen.  You  will  be  asked  some  questions  at  the  end  of  each 
part  of  the  instructions  to  be  sure  that  you  understand  how 
to  answer  the  test  questions.  Type  your  answer  on  the 
typewriter  keyboard. 

You  must  remember  two  things  in  order  to  talk  to  the  computer: 

1.  Do  not  type  anything  until  a  question  mark  (?)  appears 
on  the  screen. 

2.  Once  you  have  typed  an  answer,  the  computer  does  not 
receive  it  until  you  press  the  "RETURN”  key. 

Now,  the  first  thing  you  must  do  is  find  the  "RETURN"  key. 

This  key  is  the  large  key  near  the  right-hand  end  of  the 
second  row  of  keys. 

Now  press  the  "SPACE  BAR"  once,  followed  by  the  "RETURN"  key, 
to  continue  the  instructions. 

Screen  9101  9901 

Sometimes  the  computer  will  ask  you  to  type  a  number  to  give 
your  answer  to  a  question. 

You  will  find  the  number  keys  on  the  top  row  of  the  keyboard. 

Just  for  practice,  type  the  number  ”3".  Be  sure  to  press 
the  “RETURN”  key  afterward. 

Screen  9102  9902 

That's  good. 

Sometimes  the  computer  will  ask  you  to  type  a  word 
rather  than  a  number. 

For  practice,  type  the  word  "GO"  and  press  the  "RETURN"  key 
to  continue  the  instructions. 

Screen  9103  9902 

You're  doing  fine  so  far.  You  know  how  to  type  words  and 
numbers,  and  you  know  that  you  must  press  the  "RETURN"  key 
to  send  your  answer  to  the  computer. 

Suppose  you  make  a  mistake  typing  in  your  answer  to 
a  question.  You  can  correct  it  at  any  time  before  you 
press  the  "RETURN"  key. 

Type  "GO"  and  press  the  "RETURN”  key  to  find  out  how 
to  correct  an  error. 


—continued  on  next  page — 
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Table  A,  continued 

Normal  Sequence  of  Instructional  Screens , 
and  Resultant  Error  Screens 


Screen  Number  and  Contents 


Error 

Screens 


Screen  9985  9904 

To  correct  an  answer,  hold  down  the  “SHIFT”  key  while  you 
press  the  "RUB"  (stands  for  rub-out)  key.  The  "SHIFT"  key 
is  the  long  gray  key  at  either  end  of  the  bottom  row  of 
keys.  The  "RUB"  key  is  the  second  key  from  the  right-hand 
end  of  the  third  row  of  keys. 

The  computer  will  respond  with  a  "V*  and  the  blinking 

light  will  move  down  one  row.  You  may  then  retype  your  answer. 

Suppose  you  typed  a 
5 

where  you  meant  4 

As  long  as  you  have  not  pressed  the  "RETURN"  key,  you  can 
correct  the  error  by  following  the  above  instructions. 

To  show  that  you  understand  how  to  change  answers, 
change  the  following  "5"  to  a  ”4". 

Screen  9105  9902 

Now  you  know  what  to  do  in  case  you  make  a  mistake. 

Sometimes  the  computer  makes  a  mistake,  too  (although  it  hates 
to  admit  it),  and  you  can’t  read  the  question  on  the  screen. 

If  this  happens  you  can  repeat  the  question  by  pressing  the 
"SPACE  BAR"  and  then  the  "RETURN"  key. 

Type  "GO"  and  press  “RETURN"  to  continue. 

Screen  9987  9906 

Sometimes  you  may  not  know  the  answer  to  a  question  and  want 
to  skip  it.  To  do  this,  hold  down  the  "SHIFT”  key  and  type 
a  question  mark  (?).  Since  the  question  mark  is  the  same 
key  as  the  slash  (/),  you  must  hold  down  the  "SHIFT”  key 
while  you  press  the  "?"  (The  "SHIFT"  key  is  the  long  key  at 
the  left-hand  end  of  the  bottom  row  of  keys,  or  third  from 
the  right-hand  end  of  the  bottom  row.) 

Now  go  ahead  and  type  a  question  mark.  Don't  forget  to 
press  the  "RETURN"  key. 


--continued  on  next  page— 
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Table  A,  continued 

Normal  Sequence  of  Instructional  Screens, 
and  Resultant  Error  Screens 


Error 

Screen  Number  and  Contents  Screens 


Screen  9211  9902 

The  test  you  are  about  to  take  Is  a  test  of  your  ability 
with  words. 

Each  test  question  will  appear  on  this  screen,  followed  by 
four  or  five  possible  answers.  There  Is  only  one  correct 
answer  to  each  question. 

You  must  choose  the  correct  answer  to  each  question,  and  type 
Its  number  on  the  keyboard. 

You  may  type  a  "?"  If  you  do  not  know  the  answer  and  do  not 
want  to  guess. 

Now  type  "GO",  then  press  the  "RETURN"  key  to  continue. 

Screen  9212  9950 

There  are  five  different  kinds  of  questions  In  this  test.  9213 

9216 

One  kind  of  question  you  will  be  answering  in  this  test 
is  called  "opposites". 

Some  examples  of  opposites  are: 

GOOD  Is  the  opposite  of  BAD. 

TRUTHFULNESS  is  the  opposite  of  LYING. 

Try  this  example: 

The  opposite  of  NEAR  is: 

1.  Happy 

2.  Close 

3.  Listen 

4.  Portentous 

5.  Far 

Type  a  number  from  "1"  to  "5"  and  press  "RETURN". 


—continued  on  next  page — 
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Normal  Sequence  of  Instructional  Screens, 
and  Resultant  Error  Screens 


Error 

Screen  Number  and  Contents  Screens 


Screen  9215  9950 

That's  right.  9213 

Now  let's  try  a  different  type  of  test  question.  9216 


In  this  type  of  question  you  must  choose  the  word  or  phrase 
which  completes  a  sentence  so  that  it  makes  sense. 

For  example 

I  thought  he  was  asleep  because  his  eyes  were  _ . 

1.  dark 

2.  shut 

3.  dull 

4.  gray 

5.  heavy 

Type  a  number  from  “1“  to  "5”  and  press  "RETURN” . 

Screen  9217 

That's  right. 

In  another  kind  of  test  question  you  should  choose  the  word 
which  means  the  same  as  the  word  in  CAPITAL  letters. 

Let '8  try  an  example. 

The  word  which  means  the  same  as  PAINFUL  is: 

1.  Cup 

2.  Playful 

3.  Sore 

4.  Amputated 

5.  Smoke 

Type  a  number  from  "1"  to  "5"  and  press  "RETURN". 

Screen  9219  9950 

That '8  right.  9920 

Sometimes  this  same  kind  of  question  looks  like  this: 

We  CAPTURED  the  enemy  agent. 

1.  Caught 

2.  Tried 

3.  Scalded 

4.  Helped 

Here,  again  you  should  type  the  number  of  the  word 
that  means  the  same  as  the  word  in  CAPITAL  letters. 

Type  a  number  from  ”1"  to  "4”  and  press  "RETURN". 

--continued  on  next  page — 
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Table  A,  continued 

Normal  Sequence  of  Instructional  Screens, 
and  Resultant  Error  Screens 


Screen  Number  and  Contents 


Screen  9221 

That's  right. 

The  last  type  of  question  is  based  on  pairs  of  words  that 
are  related  in  some  way.  Your  task  is  to  decide  how  the 
first  word  in  CAPITAL  letters  is  related  to  the  second 
word  in  CAPITAL  letters.  Next  look  at  the  third  word  in 
CAPITAL  letters,  then  select  an  answer  that  has  the  same 
relationship  to  the  third  word  as  the  first  two  words  have 
to  each  other. 

For  example : 

OATS  is  to  HORSE  as  GAS  is  to  CAR. 

Now  type  "GO"  and  press  the  "RETURN"  key  to  continue. 
Screen  9222 

Here  is  a  practice  question  for  you  to  answer: 

SAILOR  is  to  NAVY  as  SOLDIER  is  to 

1.  Battle 

2.  Fort 

3.  Army 

4.  Regiment 

5.  War 

Type  a  number  from  "1"  to  "5"  and  press  "RETURN". 

Screen  9224 

That '8  right. 

You  have  now  completed  the  sample  questions. 

To  start  the  test,  type  "GO”  and  press  "RETURN". 


Error 

Screens 


9902 


9950 

9223 


9002 


Table  B 

Error  Screens  Required  In  the  Instructional  Sequence, 
and  their  Frequency  of  Use 

Number  of 

Screen  Number  and  Contents  Times  Used 

Screen  9001  140 

You  seem  to  be  having  trouble  with  the  Instructions  or  the 
equipment.  Please  call  the  test  proctor. 

Screen  9035  2 

You  have  reached  the  time  limit  of  5  minutes  for 
this  screen. 

Please  call  the  proctor  for  assistance. 

Screen  9060  22 

Incorrect  input.  TRY  AGAIN. 

Screen  9061  18 

Input  is  still  incorrect.  Check  your  instruction  manual  and 
try  again. 

Screen  9213  47 

You  didn’t  type  a  number  from  “1"  to  ”5”. 

Because  this  is  a  very  easy  sample  question, 
a  "?”  is  not  allowed  (although  you  can  answer 
with  a  ”?*'  on  the  actual  test  questions). 

Please  retype  your  answer  following  the  instructions  above. 

Screen  9214  44 

That's  not  right.  Let's  try  that  question  again. 

The  opposite  of  NEAR  is: 

1 .  Happy 

2.  Close 

3.  Listen 

4.  Portentous 

5.  Far 

Type  a  number  from  "1"  to  "5"  and  press  "RETURN”. 

Screen  9216  18 

That's  not  right.  Let's  try  that  question  again. 

You  should  choose  the  answer  that  completes  the  sentence 
so  that  it  makes  sense. 

I  thought  he  was  asleep  because  his  eyes  were  . 

1.  dark 

2.  shut 

3.  dull 

4.  gray 

5.  heavy 

Type  a  number  from  "1"  to  ”5"  and  press  "RETURN”. 

--continued  on  next  page-- 
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Error  Screens  Required  in  the  Instructional  Sequence, 
and ' their  Frequency  of  Use 

Number  of 

Screen  Number  and  Contents  Times  Used 

Screen  9218  31 

That's  not  right.  Let's  try  that  question  again. 

The  word  that  means  the  same  as  PAINFUL  is: 

1 .  Cup 

2.  Playful 

3.  Sore 

4.  Amputated 

5.  Smoke 

Type  a  number  from  "1"  to  “5"  and  press  "RETURN”. 

Screen  9220  9 

That *8  not  right.  Let's  try  that  question  again. 

You  are  to  choose  that  answer  which  is  most  similar 
in  meaning  to  the  CAPITALIZED  word. 

We  CAPTURED  the  enemy  agent. 

1 .  Caught 

2.  Tried 

3.  Scalded 

4.  Helped 

Type  a  number  from  "1"  to  ”4"  and  press  "RETURN”. 

Screen  9223  49 

That '8  not  right.  Let's  try  that  question  again. 

You  want  to  figure  out  how  the  first  pair  of  words  is 
related.  Then  choose  a  word  for  the  second  pair  of  words 
so  that  the  second  pair  is  related  in  the  same  way  as  the 
first  pair. 

SAILOR  is  to  NAVY  as  SOLDIER  is  to 

1.  Battle 

2.  Fort 

3.  Army 

4.  Regiment 

5.  War 

Type  a  number  from  "1"  to  ”5”  and  press  "RETURN”. 

— continued  on  next  page — 
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Error  Screens  Required  in  the  Instructional  Sequence, 
and  their  Frequency  of  Use 


Number  of 

Screen  Number  and  Contents  Times  Used 


Screen  9900  18 

You  found  the  "RETURN"  key,  but  you  typed  something  other  than 
a  space  before  you  pressed  it. 

In  order  to  do  well  on  these  tests,  it  is  important  that  you 
follow  instructions  carefully. 

Now  press  the  "SPACE  BAR”  once  and  then  the  "RETURN"  key, 
to  continue. 

Screen  9901  20 

You  didn't  type  the  number  “3”. 

Have  another  practice  try. 

Type  the  number  "1"  this  time,  then  press  the  "RETURN"  key. 

Screen  9902  79 

You  didn't  type  the  word  "GO". 

Please  try  again.  Type  the  word  "GO"  without  any  other 
letters  or  spaces,  and  press  the  "RETURN”  key. 

Screen  9904  214 

You  apparently  were  not  successful  in  correcting  the  error. 

Here  is  another  chance  to  practice. 

Change  the  following  "7”  to  a  "6". 

?  7_ 

Screen  9906  121 

You  didn't  type  a  question  mark. 

Remember,  you  must: 

1.  Hold  down  the  "SHIFT"  key  and 

2.  Press  the  ”?”  key. 

If  you  don't  hold  down  the  "SHIFT"  key  while  you  type  the 
question  mark,  the  computer  reads  a  slash  (/)  and  will  tell 
you  to  try  the  same  question  again. 

Now,  once  again,  type  a  question  mark. 
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Table  D 

Items  In  the  Two  Forms  of  the  Conventional  Test  in  Order  of 


Presentation  By  Form,  Biserial  Correlations  (Rbis), 
Point-Biserial  Correlations  (Rptbis),  Proportion  Correct 
(Diff),  and  IRT  Parameters  (a,  b,  c) 


Form 

and 

Item 

Rbis 

RptBis 

Diff 

IRT  Parameter 
Estimates 

a  Id  <: 

Form  1 

1893 

.82 

.53 

.77 

1.45 

-.95 

.17 

1923 

.96 

.23 

.30 

3.39 

1.26 

.13 

1888 

.82 

.52 

.84 

1.45 

-1.45 

.05 

1904 

.87 

.58 

.40 

1.77 

.36 

.11 

1705 

.72 

.34 

.92 

1.03 

-1.91 

.13 

1828 

.75 

.47 

.83 

1.15 

-1.52 

.06 

1731 

.84 

.63 

.57 

1.52 

-.06 

.05 

1816 

.87 

.50 

.42 

1.71 

.64 

.19 

1890 

.75 

.48 

.82 

1.14 

-1.49 

.06 

1793 

.92 

.34 

.11 

2.41 

1.77 

.04 

1726 

.78 

.37 

.93 

1.26 

-1.84 

.09 

1919 

.83 

.52 

.53 

1.50 

.12 

.20 

1723 

.79 

.56 

.70 

1.28 

-.50 

.06 

1748 

.89 

.38 

.08 

1.92 

1.91 

.04 

1886 

.72 

.44 

.84 

1.03 

-1.67 

.06 

1843 

.77 

.53 

.72 

1.22 

-.90 

.09 

1768 

.69 

.36 

.89 

.94 

-1.75 

.08 

1905 

.82 

.46 

.31 

1.46 

.88 

.14 

1806 

.81 

.52 

.82 

1.41 

-1.20 

.06 

1811 

.83 

.50 

.53 

1.46 

.32 

.24 

1791 

.83 

.57 

.42 

1.48 

.56 

.11 

1713 

.74 

.46 

.83 

1.09 

-1.18 

.06 

1783 

.82 

.58 

.63 

1.41 

-.14 

.12 

1778 

.77 

.53 

.73 

1.22 

-.60 

.10 

1894 

.77 

.42 

.89 

1.-0 

-1.94 

.07 

1796 

.65 

.31 

.91 

.86 

-1.91 

.23 

1716 

.79 

.57 

.64 

1.31 

-.25 

.08 

1917 

.75 

.47 

.82 

1.14 

-1.51 

.06 

1889 

.75 

.48 

.77 

1.03 

-1.01 

.18 

1870 

.80 

.46 

.26 

1.34 

1.03 

.11 

Mean 

.80 

.47 

.64 

1.42 

-.50 

.11 

SD 

.07 

.09 

.25 

.49 

1.17 

.06 

- 

continued 

on  the  next  page- 
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Table  D,  continued 

Items  in  the  Two  Forms  of  the  Conventional  Test  in  Order  of 
Presentation  by  Form,  Biserial  Correlations  (Rbis), 
Point-Biserial  Correlations  (Rptbis),  Proportion  Correct 
(Diff),  and  IRT  Parameters  (a,  b,  c) 


RptBis 


IRT  Parameter 
Estimates 
b 
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Table  E 

Descriptive  Statistics  for  the  Scores  on  the  Bayesian  Adaptive  Tests, 
1  Values  for  Differences  in  Means  and  Variances  Between  the  Forms, 
and  Correlations  Between  Scores  on  the  Two  Forms, 
for  Test  Lengths  of  1  to  30  Items 


Test 

Length 

Form  1 

Form  2 

t 

Values 

Skew- 

Mean  SD  ness 

Knr- 

tosis 

Skew- 

Mean  SD  ness 

Kur- 

tosis 

Mean 

Vari¬ 

ances 

r 

1 

-.05  .64  -.02 

-2.02 

-.18  .63  .25 

-1.96 

3.10** 

.34 

.451 

2 

-.02  .71  .05 

-.85 

-.10  .77  .07 

-1.13 

1.72 

-1.37 

.505 

3 

.03  .77  .13 

.13 

-.06  .79  .08 

-.41 

1.92 

-.58 

.571 

4 

.05  .81  -.02 

-.12 

-.05  .84  .04 

-.  30 

2.41 

-.98 

.672 

5 

.06  .84  -.02 

-.04 

-.00  .84  .04 

-.21 

1.50 

.00 

.733 

6 

.05  .84  -.11 

-.28 

.04  .85  -.03 

-.30 

.46 

-.26 

.751 

7 

.06  .84  -.12 

-.27 

.04  .85  -.11 

-.37 

.76 

-.10 

.777 

8 

.07  .85  -.05 

-.26 

.03  .86  -.15 

-.34 

1.40 

-.10 

.787 

9 

.08  .86  -.07 

-.36 

.02  .87  -.07 

-.30 

1.77 

-.23 

.800 

10 

.08  .87  -.05 

-.31 

.02  .87  -.04 

-.20 

1.52 

-.06 

.808 

11 

.07  .87  -.05 

-.26 

.02  .87  .00 

-.18 

1.55 

-.07 

.822 

12 

.07  .88  -.08 

-.27 

.02  .87  -.04 

-.12 

1.56 

.27 

.833 

13 

.07  .88  -.08 

-.28 

.02  .86  -.05 

-.10 

1.56 

.52 

.845 

14 

.07  .88  -.09 

-.22 

.02  .86  -.06 

-.15 

1.66 

.64 

.853 

15 

.06  .88  -.10 

-.26 

.02  .87  -.06 

-.19 

1.57 

.50 

.858 

16 

.06  .88  -.10 

-.25 

.02  .87  -.01 

-.21 

1.51 

.40 

.860 

17 

.07  .89  -.10 

-.23 

.02  .87  .01 

-.22 

1.49 

.77 

.864 

18 

.07  .89  -.08 

-.25 

.02  .87  -.00 

-.21 

1.68 

.67 

.869 

19 

.07  .89  -.09 

-.23 

.03  .87  -.01 

-.17 

1.48 

.60 

.871 

20 

.06  .89  -.07 

-.23 

.03  .87  -.02 

-.15 

1.05 

.57 

.875 

21 

.06  .89  -.08 

-.24 

.03  .87  -.03 

-.12 

.99 

.69 

.877 

22 

.06  .89  -.06 

-.25 

.04  .88  -.03 

-.14 

1.08 

.67 

.884 

23 

.06  .89  -.07 

-.27 

.04  .88  -.04 

-.15 

.80 

.47 

.887 

24 

.06  .89  -.06 

-.26 

.04  .88  -.06 

-.13 

.84 

.44 

.887 

25 

.06  .89  -.09 

-.25 

.04  .87  -.05 

-.14 

.78 

.72 

.888 

26 

.06  .90  -.08 

-.26 

.04  .88  -.05 

-.15 

.62 

.80 

.889 

27 

.06  .90  -.08 

-.24 

.04  .88  -.05 

-.15 

.66 

.85 

.891 

28 

.06  .90  -.09 

-.23 

.04  .88  -.05 

-.16 

.65 

.65 

.893 

29 

.05  .90  -.07 

-.24 

.04  .88  -.05 

-.17 

.54 

.65 

.894 

30 

.06  .90  -.08 

-.24 

.04  .88  -.04 

-.18 

.59 

.78 

.897 

♦Differences  statistically  significant  at  £  £  .05. 
♦♦Differences  statistically  significant  at  £  .01. 
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Table  G 

Pearson  Product-Moment  Correlations  of  Scores  on  Forms  1  and  2 
of  the  Adaptive  Test  (Al,  A2)  and  the  Conventional  Test  (Cl,  C2) 
with  Number-Correct  Scores  on  the  50-Item  Criterion  Test, 
and  Results  of  Tests  of  the  Significance  of  Differences 
in  Pairs  of  Correlations 


Test 

Length 

Adaptive 

Conventional 

Significant 
Al  vs  • 

Differences 
A2  vs. 

Form  1 

Form  2 

Form  1 

Form  2 

Cl 

C2 

Cl  C2 

1 

.39 

.39 

.28 

.31 

2 

.60 

.53 

.25 

.52 

* 

* 

3 

.64 

.63 

.37 

.61 

* 

* 

4 

.69 

.68 

.51 

.67 

* 

* 

5 

.73 

.72 

.56 

.67 

* 

* 

6 

.76 

.74 

.61 

.67 

* 

* 

* 

7 

.77 

.76 

.65 

.68 

* 

* 

* 

8 

.79 

.77 

.67 

.70 

* 

* 

* 

9 

.79 

.78 

.70 

.71 

* 

* 

* 

10 

.80 

.79 

.71 

.73 

* 

* 

11 

.80 

.80 

.71 

.76 

* 

* 

12 

.82 

.81 

.72 

.77 

* 

* 

13 

.82 

.81 

.71 

.77 

* 

* 

14 

.82 

.82 

.72 

.77 

* 

* 

15 

.83 

.83 

.72 

.77 

* 

* 

16 

.83 

.82 

.74 

.78 

* 

* 

17 

.83 

.82 

.75 

.78 

* 

* 

18 

.83 

.82 

.75 

.78 

* 

* 

19 

.83 

.83 

.76 

.78 

* 

* 

20 

.83 

.83 

.77 

.78 

21 

.83 

.83 

.77 

.78 

22 

.84 

.83 

.77 

.78 

* 

23 

.84 

.83 

.78 

.78 

* 

24 

.84 

.83 

.78 

.79 

25 

.84 

.84 

.78 

.79 

* 

26 

.84 

.84 

.79 

.79 

27 

.84 

.84 

.79 

.79 

28 

.84 

.84 

.79 

.80 

29 

.84 

.84 

.79 

.80 

30 

.84 

.84 

.80 

.81 

*j>  _<  .05. 
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